Introduction

Welcome! The purpose of this project is to produce an algorithm that can help the online real-estate website, Zillow.com predict home sale prices with greater accuracy. Improving real-estate valuations are important for several reasons. Namely it:

  1. Improves user experience
  2. Provides context for anticipated property taxes and homeowner’s insurance
  3. Alleviates historic and systemic biases commonplace with home valuations in neighborhoods of color
  4. Supports informed decision-making among families regarding long-term investing (b/c buying a home is an investment)

The gains from such an algorithm are worthwhile, however this project proved to be a challenging exercise. Namely, it is incredibly difficult to quantify the features which influence home values without leaning into historical disparities (homes in ‘good’ versus ‘bad’ areas are often paired with race|class|amenity imparity.) Likewise, the presence of colinear variables may effect the performance of our model.

We employed the hedonic model for this project (HM). Simply put, HM refers to a regression model that estimates the influence various factors have on the price of a good (i.e.home) and is commonly used in real estate pricing. In this case, home price served as our dependent variable, whilst various features served as our independent variables. We identified a collection of both physical and demographic features we believed influence the price of a home, cleaned these datasets and tested their significance towards home valuation. Furthermore, we checked our results through cross-validation testing, and employed statistical metrics such as mean absolute error (MAE), mean absolute percentage error (MAPE), and Moran’s I.

Ultimately, we produced a functional model though with a major weakness: it is not generalizable across majority white vs majority non-white neighborhoods. Specifically, we saw an increase in our MAPE within majority non-white neighborhoods as well as lower average home valuations. We review the reasons for this later.

Nonetheless, this project was a great exercise. We look forward to use reviewing our model & code!
Click “code” on the right-hand side to view what takes place behind-the-scenes.

Data Wrangling

To gather data, our team focused on Mecklenberg County, NC (Charlotte Metro Area) and sourced information from the county’s open data website, as well as the American Community Survey and U.S.Census.

Charlotte Home Sales Data

To begin, we will import a home sales dataset that includes variables like location, housing characteristics, and home quality for the Charlotte Metro Area. After, we will ‘clean’ our data by creating useful columns such as, “building grade” as a numeric value (where higher values correspond to greater quality), “age of home (age)”, “price per square foot (pricesqft)” and calculating the # of “total baths (totbaths)” by joining full and half-bathroom information. Moving forward, we’ll refer to this home sales data as “internal variables.”

#Import sales data, remove columns we won't need, add useful columns, set CRS for North Carolina
library(dplyr)
library(sf)
CLT_internal <- 
  st_read("https://github.com/mafichman/MUSA_508_Lab/raw/main/Midterm/data/2022/studentData.geojson") %>%
  st_transform('ESRI:103500')

CLT_internal <- CLT_internal[c(5,9,20,21,26,28,30:46,57:60,67,68,70,71,72)] %>%
  mutate(
    age = 2022 - yearbuilt,
    sqft = (totalac * 43560),
     pricesqft = ((totalac * 43560)/price),
        totbaths = (halfbaths*0.5)+(fullbaths))

  CLT_internal$quality <- recode(CLT_internal$bldggrade, MINIMUM = 1 , FAIR = 2, AVERAGE = 3, GOOD = 4, VERYGOOD = 5, EXCELLENT = 6, CUSTOM = 7)

Adding Amenities Data

To build a strong algorithm (model), it’s important to include variables that relate to the housing market such as local schools, grocery stores, and parks. We’ll refer to these variables as “amenities.”

# Adding school data 
CLT_schools <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Schools.geojson") %>%
  st_transform(st_crs(CLT_internal))

# Adding grocery store data
CLT_grocery <- 
  st_read("Grocery_pts.geojson") %>%
  st_transform(st_crs(CLT_internal))

# Adding parks data 
CLT_parks <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Parks.geojson") %>%
  st_transform(st_crs(CLT_internal))

Adding Spatial Structure Data

Finally, we will add variables that provide demographic and environmental data for the Charlotte Metro Area. Specifically, we will include educational attainment, household income, and neighborhoods data. We’ll refer to these variables as “spatial structure.”

# Adding demographic data
CLT_demo <- 
  get_acs(geography = "tract", 
          variables = c("B19013_001E", "B15003_022E","B15003_001E"), 
          year=2020, state=37, county=119, 
          geometry=TRUE, output="wide") %>%
  st_transform(st_crs(CLT_internal)) %>%
  dplyr::select( -NAME, -B19013_001M, -B15003_022M, -B15003_001M)

CLT_demo <-
  CLT_demo %>%
  rename(HH_inc = B19013_001E, 
         College = B15003_022E,
         College_age_pop = B15003_001E) %>%
  mutate(college_perc = College/College_age_pop) %>%
  dplyr::select( -College, -College_age_pop)

# Adding neighborhood data 
CLT_neighborhoods <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/School_districts.geojson") %>%
  st_transform(st_crs(CLT_internal))

Exploratory Data Analysis

Orienting Our Variables

So far, we have added internal, amenities, and spatial structure variables. However, in order to build our model and analyze how these variables relate to home sales, we must modify them. We’ll achieve this using 2 techniques:

1. K-nearest neighbor (KNN): this will find the distance between a given home and the most near amenities (school, grocery store, park).

2. Spatial join (SJ): this will join our spatial structure data (educational attainment, neighborhoods) to our internal varies (Charlotte homes sales)

Note to instructor: the nn_function did not work as normal, perhaps due to the geometry featuring multiple points (versus just X and Y), so we took the central point of each feature.
# Most near school, grocery store, and park 
CLT_internal <-
  CLT_internal %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_schools)),k = 1),
      grocery_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_grocery)), k = 1),
       park_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_parks)), k = 1))

# Spatial join 
CLT_internal <- 
  st_join(CLT_internal,CLT_demo) 

CLT_internal <- 
  st_join(CLT_internal, CLT_neighborhoods)

Summary Statistics

Below are summary statistics tables for each variable category (internal, amenities, spatial structure).

#Internal variables 
ss_internal <- CLT_internal
ss_internal <- st_drop_geometry(ss_internal)
ss_internal <- ss_internal %>%
  dplyr::select("sqft","pricesqft", "totbaths", "yearbuilt") 
stargazer(as.data.frame(ss_internal), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## =======================================================
## Statistic   N      Mean    St. Dev.   Min      Max     
## -------------------------------------------------------
## sqft      46,183 77,176.9 4,648,606.0 0.0 425,034,086.0
## pricesqft 46,138  Inf.0               0.0     Inf.0    
## totbaths  46,183   2.6        0.9     0.0     11.0     
## yearbuilt 46,183 1,993.4     31.8      0      2,022    
## -------------------------------------------------------
#Amenities 
ss_amenities <- CLT_internal
ss_amenities <- st_drop_geometry(ss_amenities)
ss_amenities <- ss_amenities %>%
  dplyr::select("school_nn1", "grocery_nn1", "park_nn1") 
stargazer(as.data.frame(ss_amenities), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables
## ================================================
## Statistic     N     Mean   St. Dev. Min    Max  
## ------------------------------------------------
## school_nn1  46,183 3,266.8 1,390.4  8.2  8,369.4
## grocery_nn1 46,183 1,737.8 1,274.4  37.0 8,599.7
## park_nn1    46,183 1,239.8  767.7   21.0 5,026.3
## ------------------------------------------------
# Spatial Structure
ss_spatial <- CLT_internal
ss_spatial <- st_drop_geometry(ss_spatial)
ss_spatial <- ss_spatial %>%
  dplyr::select("HH_inc", "college_perc", "FID") 
stargazer(as.data.frame(ss_spatial), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## ====================================================
## Statistic      N      Mean   St. Dev.  Min     Max  
## ----------------------------------------------------
## HH_inc       46,180 86,175.7 37,415.7 17,685 238,750
## college_perc 46,180   0.3      0.1     0.03    0.6  
## FID          46,183   26.0     10.6     0      43   
## ----------------------------------------------------

Correlation Matrix

Below is a table visualizing correlation between our variables. We can see the home price maintains apositive correlation with the following variables (in order of strength):

  • heated square footed of the home
  • home quality
  • total bathrooms
  • household income
  • percentage of residents with college degrees
  • bedrooms
  • number of fireplaces in the home

We can use this matrix to inform our variable selection for our model.

numericVars <- select_if(st_drop_geometry(CLT_internal), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("deepskyblue", "grey100", "firebrick1"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation Matrix of Numeric Variables", tl.cex = 0.5, tl.col = "black") +
  theme(axis.text.x = element_text(size = 10)) +
  plotTheme()

Scatterplots

Below are 4 home price correlation scatterplots based upon the results of our correlation matrix:

st_drop_geometry(CLT_internal) %>% 
  dplyr::select(price, quality, heatedarea, HH_inc, yearbuilt) %>% 
    filter(price < 10000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "hotpink") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Internal and Spatial Variables") +
  theme(axis.text.x = element_text(size=7)) +
  plotTheme()

Maps

Below are 4 maps including:

1. A map of our dependent variable (price)

2. A map of park locations

3. A map of nearby grocery stores

4. A map of school quality

Note: the first 3 maps are in relation to home prices within the Charlotte Metro Area.
grid.arrange(ncol=2,
ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Price", subtitle="Charlotte Metro Area") +
  labs(color = "Observed Sales Price (quintiles)") +
   theme(plot.title=element_text(size=14, face='bold'),
         legend.key.size = unit(1.5, 'cm'), #change legend key size
        legend.key.height = unit(1, 'cm'), #change legend key height
        legend.key.width = unit(1, 'cm'), #change legend key width
        legend.title = element_text(size=14), #change legend title font size
        legend.text = element_text(size=10)) + #change legend text font size
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey70") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = CLT_parks, color="darkgreen") +
  labs(title="Park Locations vs Home Price", subtitle="Charlotte Metro Area") + 
  theme(plot.title=element_text(size=14, face='bold'),
        legend.key.size = unit(1.5, 'cm'), 
        legend.key.height = unit(1, 'cm'),
        legend.key.width = unit(1, 'cm'), 
        legend.title = element_text(size=14), 
        legend.text = element_text(size=10)) + 
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey30") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = CLT_grocery, color="deepskyblue") +
  labs(title="Grocery Store Locations vs Home Price", subtitle="Charlotte Metro Area") + 
  theme(plot.title=element_text(size=14, face='bold'),
    legend.key.size = unit(1.5, 'cm'), 
        legend.key.height = unit(1, 'cm'),
        legend.key.width = unit(1, 'cm'), 
        legend.title = element_text(size=14), 
        legend.text = element_text(size=10)) + 
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_schools, aes(fill=factor(Quality), color=factor(Quality))) +
  scale_fill_brewer(palette="RdYlGn") +
  scale_color_brewer(palette="RdYlGn") +
  labs(title="School Quality", subtitle="Niche.com ratings; Charlotte Metro Area") +
   theme(plot.title=element_text(size=14, face='bold'),
     legend.key.size = unit(1.5, 'cm'), 
        legend.key.height = unit(1, 'cm'), 
        legend.key.width = unit(1, 'cm'),
        legend.title = element_text(size=14), 
        legend.text = element_text(size=10)) + 
  mapTheme())

Regression Model

Splitting the Data

Now that we have identified important variables, we will build out model by dividing the data into training/test sets. Our data will be partitioned as a 60/40 train-test split. After, we’ll run a regression on our training set (60%) and use the results to determine the generalizability with our ‘test’ data.

However, before dividing the dataset we will create categorical variables for the number of floors, bathrooms, and bedrooms for a given home.

#Re-engineering data as categorical: number of floors
CLT_internal<- 
  CLT_internal%>%
  mutate(NUM_FLOORS.cat = ifelse(storyheigh == "1 STORY" | storyheigh == "1.5 STORY" | storyheigh == "SPLIT LEVEL" | storyheigh == "2.0 STORY", "Up to 2 Floors",
               ifelse(storyheigh == "2.5 STORY" | storyheigh == "3.0 STORY", "Up to 3 Floors", "4+ Floors")))
#Re-engineer bedroom as categorical
CLT_internal <- 
  CLT_internal %>%
  mutate(NUM_BEDS.cat = ifelse(bedrooms <= 2, "Up to 2 Bedrooms",
                               ifelse(bedrooms == 3 | bedrooms == 4, "Up to 4 Bedrooms", "5+ Bedrooms")))
#Re-engineer bathroom data as categorical
CLT_internal <- 
  CLT_internal %>%
  mutate(NUM_BATHS.cat = ifelse(totbaths <= 2.5, "Up to 2.5 Bathrooms",
                               ifelse(totbaths <= 3.5 | totbaths <= 4.5, "Up to 4 Bathrooms", "5+ Bathrooms")))
#Creating training data
inTrain <- createDataPartition(
              y = paste(CLT_internal$NUM_FLOORS.cat, CLT_internal$NUM_BEDS.cat), 
              p = .60, list = FALSE)
charlotte.training <- CLT_internal[inTrain,] 
charlotte.test <- CLT_internal[-inTrain,]  

reg.training <- lm(price ~ ., data = st_drop_geometry(charlotte.training) %>% 
                                    dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, 
                                               park_nn1, grocery_nn1,
                                               age, HH_inc, college_perc))
summary(reg.training)
## 
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(charlotte.training) %>% 
##     dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat, 
##         NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age, 
##         HH_inc, college_perc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1588349   -84853   -10729    59231 43743238 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                      -260650.9662   40859.6608  -6.379
## heatedarea                           134.2995       4.5987  29.204
## quality                           194599.2921    5172.7628  37.620
## NUM_FLOORS.catUp to 2 Floors       48508.9091   22036.6104   2.201
## NUM_FLOORS.catUp to 3 Floors       -7429.1503   28550.7081  -0.260
## NUM_BEDS.catUp to 2 Bedrooms       46547.6155   15832.3511   2.940
## NUM_BEDS.catUp to 4 Bedrooms       36982.9267   10437.9678   3.543
## NUM_BATHS.catUp to 2.5 Bathrooms -430633.8824   25321.7332 -17.006
## NUM_BATHS.catUp to 4 Bathrooms   -436954.6889   23004.4329 -18.994
## park_nn1                              -8.4088       3.4074  -2.468
## grocery_nn1                          -20.9651       2.2827  -9.184
## age                                 2150.0810     123.2861  17.440
## HH_inc                                 0.7120       0.1182   6.022
## college_perc                       55718.5589   33367.8905   1.670
##                                              Pr(>|t|)    
## (Intercept)                            0.000000000181 ***
## heatedarea                       < 0.0000000000000002 ***
## quality                          < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors                 0.027725 *  
## NUM_FLOORS.catUp to 3 Floors                 0.794705    
## NUM_BEDS.catUp to 2 Bedrooms                 0.003285 ** 
## NUM_BEDS.catUp to 4 Bedrooms                 0.000396 ***
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms   < 0.0000000000000002 ***
## park_nn1                                     0.013600 *  
## grocery_nn1                      < 0.0000000000000002 ***
## age                              < 0.0000000000000002 ***
## HH_inc                                 0.000000001746 ***
## college_perc                                 0.094966 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 403900 on 25546 degrees of freedom
##   (2154 observations deleted due to missingness)
## Multiple R-squared:  0.3015, Adjusted R-squared:  0.3011 
## F-statistic: 848.1 on 13 and 25546 DF,  p-value: < 0.00000000000000022

Evaluating Generalizability # 1

To test the strength (accuracy) of our model in its ability to predict prices, we will:

  1. Find the mean absolute error (MAE) + mean absolute percentage error (MAPE)
  2. Conduct cross-validation tests
  3. Plot predicted prices as a function of observed prices
  4. Map our test set residuals, including a Moran’s I test and a plot of the spatial lag in errors
  5. Map of mean absolute percentage error (MAPE) by neighborhood.
  6. Create a scatterplot of MAPE by neighborhood as a function of mean price by neighborhood

MAE & MAPE

#Creating predictions and calculating Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE)
charlotte.test <-
  charlotte.test %>%
  mutate(price.Predict = predict(reg.training, charlotte.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000)

MAE <- mean(charlotte.test$price.AbsError, na.rm = T)
MAPE <- mean(charlotte.test$price.APE, na.rm = T)

reg.MAE.MAPE <- 
  cbind(MAE, MAPE) %>%
  kable(caption = "Regression MAE & MAPE") %>%
  kable_styling("hover",full_width = F) 

reg.MAE.MAPE
Regression MAE & MAPE
MAE MAPE
107955.6 0.2289662

Cross Validation

#Cross-validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(CLT_internal) %>% 
                                dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, grocery_nn1,
                                               age, HH_inc, college_perc, 
                                               park_nn1), 
     method = "lm", trControl = fitControl, na.action = na.pass)
# Table
stargazer(as.data.frame(reg.cv$resample), type="text", digits=1, title="Cross Validation Results")
## 
## Cross Validation Results
## =======================================================
## Statistic  N    Mean    St. Dev.     Min        Max    
## -------------------------------------------------------
## RMSE      100 268,328.6 296,854.0 133,413.4 2,137,604.0
## Rsquared  100    0.6       0.2      0.01        0.8    
## MAE       100 114,163.2 17,868.1  93,420.4   207,393.9 
## -------------------------------------------------------
# Plot 
ggplot(reg.cv$resample, aes(x=MAE)) +
  geom_histogram(fill = "darkgreen") +
  labs(title = "Count of Mean Average Error During Cross-Validation") +
  xlab("MAE")+
  ylab("Count")+
  plotTheme()

Visualizing Prediction Errors

#Visualizing prediction errors
charlotte.APE <- charlotte.test[c(6,36,43:46,51,54)]

charlotte_APE.sf <- 
  charlotte.APE %>%
  filter(price.APE > 0) %>%
  st_as_sf(sf_column_name=geometry) %>%
  st_transform('ESRI:103500')

grid.arrange(ncol=2,
             ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = charlotte_APE.sf, aes(color = q5(price.Predict)), size = .25) +
  scale_colour_manual(values = palette5,
                   labels=qBr(charlotte.APE,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Predicted Sales Price") +
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = charlotte_APE.sf, aes(color = price.APE), size = .25) +
  labs(title="Predicted Sales Price\nAbsolute Percent Error") +
  binned_scale(aesthetics = "color",
               scale_name = "stepsn",
               palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
               breaks = c(0.10, 0.20, 0.5, 0.75),
               limits = c(0, 50),
               show.limits = TRUE,
               guide = "colorsteps"
  ) +
  mapTheme())

#Predicted vs observed sales price
ggplot(
charlotte_APE.sf, aes(price, price.Predict, col = price.APE)) +
binned_scale(aesthetics = "color",
    scale_name = "stepsn", 
    palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
    breaks = c(0.10, 0.20, 0.5, 0.75),
    limits = c(0, 50),
    show.limits = TRUE, 
    guide = "colorsteps") +
geom_point(size=1) +
scale_y_continuous(limits = c(0, 4000000)) +
scale_x_continuous(limits = c(0, 4000000)) +
labs(title="Sales Price vs. Predicted", subtitle="Charlotte Metro Area") +
ylab("Predicted Sales Price (in dollars)") +
xlab("Observed Sales Price (in dollars)") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", size = 0.5) +
labs(color = "Absolute % Error") +
geom_label(
  label="0% error line", 
  x=3500000,
  y=3000000,
  label.padding = unit(0.25, "lines"), # Rectangle size around label
  label.size = 0.15,
  color = "black",
  fill="grey80")

Moran’s I

By calculating Moran’s I for this dataset, we are determining if there is a spatial autocorrelation. relationship among home sales in Charlotte. As seen in the outcome, Moran’s I was calculated to be high, meaning there is a spatial relationship in the predicted errors that must be accounted for.

#Calculating Moran's I
coords.test <-  st_coordinates(charlotte.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
charlotte.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK = TRUE))
## Simple feature collection with 18457 features and 54 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 422146 ymin: 141692.3 xmax: 467637.2 ymax: 196797.3
## Projected CRS: NAD_1983_CORS96_StatePlane_North_Carolina_FIPS_3200
## First 10 features:
##         pid totalac codemunici   municipali zipcode   price yearbuilt
## 1  00101308  0.0000          6 HUNTERSVILLE   28078  675000      2010
## 2  00101323  0.0000          6 HUNTERSVILLE   28078 1125000      1991
## 3  00101327  1.3400          6 HUNTERSVILLE   28031  915000      2014
## 4  00101333  0.5000       <NA> HUNTERSVILLE   28078 1315000      2005
## 5  00101334  0.5000       <NA> HUNTERSVILLE   28078  925000      1997
## 6  00101353  0.4789       <NA> HUNTERSVILLE   28078  710000      2008
## 7  00101431  0.0000          6 HUNTERSVILLE   28078  405000      2003
## 8  00101433  0.0000          6 HUNTERSVILLE   28078  675000      2003
## 9  00102218  2.4400          6 HUNTERSVILLE   28078  272000      1910
## 10 00102228  0.4100          6 HUNTERSVILLE   28078 1160000      1995
##    heatedarea cdebuildin descbuildi storyheigh aheatingty heatedfuel     actype
## 1        3244         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 2        4272         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 3        3542         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 4        6187         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 5        2736         01        RES  1.5 STORY AIR-DUCTED        GAS AC-CENTRAL
## 6        3377         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 7        2500         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 8        2716         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 9        1312         01        RES    1 STORY AIR-DUCTED        GAS    AC-NONE
## 10       4304         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
##                extwall  foundation numfirepla fireplaces bldggrade fullbaths
## 1           FACE BRICK CRAWL SPACE          1        FP2 VERY GOOD         3
## 2           FACE BRICK CRAWL SPACE          1        FP3 VERY GOOD         4
## 3  HARDIPLK/DSGN VINYL CRAWL SPACE          1        FP2 VERY GOOD         4
## 4         STUCCO HRDCT CRAWL SPACE          1        FP4 EXCELLENT         6
## 5           FACE BRICK CRAWL SPACE          1        FP2      GOOD         2
## 6  HARDIPLK/DSGN VINYL CRAWL SPACE          1        FP2      GOOD         3
## 7         STUCCO HRDCT CRAWL SPACE          1        FP2      GOOD         3
## 8           ALUM,VINYL CRAWL SPACE          1        FP2      GOOD         2
## 9         WOOD ON SHTG CRAWL SPACE          1        FP3   AVERAGE         1
## 10        STUCCO HRDCT CRAWL SPACE          1        FP4 VERY GOOD         4
##    halfbaths bedrooms units landusecod physicalde propertyus    descproper
## 1          1        4     1       R100         AV         01 Single-Family
## 2          0        4     1       R122         AV         01 Single-Family
## 3          1        4     1       R120         AV         01 Single-Family
## 4          1        7     1       R122         AV         01 Single-Family
## 5          1        3     1       R122         AV         01 Single-Family
## 6          1        4     1       R120         AV         01 Single-Family
## 7          0        5     1       R100         AV         01 Single-Family
## 8          1        3     1       R100         AV         01 Single-Family
## 9          0        2     1       R120         AV         01 Single-Family
## 10         1        5     1       R122         AV         01 Single-Family
##    shape_Leng shape_Area musaID toPredict age      sqft  pricesqft totbaths
## 1    736.0496   26070.16      2 MODELLING  12      0.00 0.00000000      3.5
## 2    908.4596   38364.22      4 MODELLING  31      0.00 0.00000000      4.0
## 3   1074.4335   58523.33      5 MODELLING   8  58370.40 0.06379279      4.5
## 4    630.5433   19277.33      7 MODELLING  17  21780.00 0.01656274      6.5
## 5    633.6883   23621.33      8 MODELLING  25  21780.00 0.02354595      2.5
## 6    569.7271   20847.22     11 MODELLING  14  20860.88 0.02938153      3.5
## 7    669.8975   19726.72     13 MODELLING  19      0.00 0.00000000      3.0
## 8    686.9222   19762.49     14 MODELLING  19      0.00 0.00000000      2.5
## 9   1475.0216  102353.15     21 MODELLING 112 106286.40 0.39075882      1.0
## 10   627.7805   17684.37     23 MODELLING  27  17859.60 0.01539621      4.5
##    quality school_nn1 grocery_nn1 park_nn1       GEOID HH_inc college_perc FID
## 1       NA   6709.168    3863.565 2444.509 37119006220  92038     0.293934  24
## 2       NA   6920.620    3666.694 2741.707 37119006220  92038     0.293934  24
## 3       NA   6841.804    4012.071 2443.588 37119006220  92038     0.293934  24
## 4        6   6976.639    3974.910 2579.115 37119006220  92038     0.293934  24
## 5        4   7157.136    3970.221 2737.943 37119006220  92038     0.293934  24
## 6        4   6891.681    4098.060 2427.413 37119006220  92038     0.293934  24
## 7        4   6285.975    3712.976 2295.177 37119006220  92038     0.293934  24
## 8        4   6326.410    3685.299 2338.371 37119006220  92038     0.293934  24
## 9        3   5882.428    2654.306 1504.839 37119006220  92038     0.293934  24
## 10      NA   6278.135    2425.154 1336.914 37119006220  92038     0.293934  24
##    MIDD_NAME Shape_Leng Shape_Area                  geometry NUM_FLOORS.cat
## 1    Bradley    2486536 1051124394     POINT (433830 188283) Up to 2 Floors
## 2    Bradley    2486536 1051124394 POINT (433952.9 188553.6) Up to 2 Floors
## 3    Bradley    2486536 1051124394   POINT (433651 188354.6) Up to 2 Floors
## 4    Bradley    2486536 1051124394   POINT (433648.4 188499) Up to 2 Floors
## 5    Bradley    2486536 1051124394   POINT (433609.9 188678) Up to 2 Floors
## 6    Bradley    2486536 1051124394 POINT (433556.8 188369.4) Up to 2 Floors
## 7    Bradley    2486536 1051124394 POINT (434129.3 187942.8) Up to 2 Floors
## 8    Bradley    2486536 1051124394   POINT (434138.4 187989) Up to 2 Floors
## 9    Bradley    2486536 1051124394 POINT (435429.3 187849.1) Up to 2 Floors
## 10   Bradley    2486536 1051124394   POINT (435426.3 188248) Up to 2 Floors
##        NUM_BEDS.cat       NUM_BATHS.cat price.Predict price.Error
## 1  Up to 4 Bedrooms   Up to 4 Bathrooms            NA          NA
## 2  Up to 4 Bedrooms   Up to 4 Bathrooms            NA          NA
## 3  Up to 4 Bedrooms   Up to 4 Bathrooms            NA          NA
## 4       5+ Bedrooms        5+ Bathrooms     1799804.4   484804.35
## 5  Up to 4 Bedrooms Up to 2.5 Bathrooms      569450.6  -355549.36
## 6  Up to 4 Bedrooms   Up to 4 Bathrooms      625495.9   -84504.05
## 7       5+ Bedrooms   Up to 4 Bathrooms      490668.1    85668.05
## 8  Up to 4 Bedrooms Up to 2.5 Bathrooms      563197.5  -111802.47
## 9  Up to 2 Bedrooms Up to 2.5 Bathrooms      418187.9   146187.89
## 10      5+ Bedrooms   Up to 4 Bathrooms            NA          NA
##    price.AbsError price.APE lagPriceError
## 1              NA        NA            NA
## 2              NA        NA            NA
## 3              NA        NA            NA
## 4       484804.35 0.2693650            NA
## 5       355549.36 0.6243726            NA
## 6        84504.05 0.1350993            NA
## 7        85668.05 0.1745947            NA
## 8       111802.47 0.1985138            NA
## 9       146187.89 0.3495747            NA
## 10             NA        NA     -294250.1
moranTest <- moran.mc(charlotte.test$price.Error, 
                      spatialWeights.test, nsim = 999, na.action=na.exclude, , zero.policy = TRUE)

# Observed and permuted Moran's I 
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Accounting for Neighborhood

We introduce neighborhoods into the model in an attempt to account for spatial bias in our predictions. Specifically, we have included “Census Block Groups” as it was readily available information. The addition of this variable increases the R-squared of our model, meaning it is able to explain more of the observed variance with neighborhoods included than without it.

#Adjusting for neighborhood
reg.nhood <- lm(price ~ ., data = as.data.frame(charlotte.training) %>% 
                                 dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, 
                                               park_nn1, grocery_nn1,
                                               age, HH_inc, college_perc))
summary(reg.nhood)
## 
## Call:
## lm(formula = price ~ ., data = as.data.frame(charlotte.training) %>% 
##     dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat, 
##         NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age, 
##         HH_inc, college_perc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1588349   -84853   -10729    59231 43743238 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                      -260650.9662   40859.6608  -6.379
## heatedarea                           134.2995       4.5987  29.204
## quality                           194599.2921    5172.7628  37.620
## NUM_FLOORS.catUp to 2 Floors       48508.9091   22036.6104   2.201
## NUM_FLOORS.catUp to 3 Floors       -7429.1503   28550.7081  -0.260
## NUM_BEDS.catUp to 2 Bedrooms       46547.6155   15832.3511   2.940
## NUM_BEDS.catUp to 4 Bedrooms       36982.9267   10437.9678   3.543
## NUM_BATHS.catUp to 2.5 Bathrooms -430633.8824   25321.7332 -17.006
## NUM_BATHS.catUp to 4 Bathrooms   -436954.6889   23004.4329 -18.994
## park_nn1                              -8.4088       3.4074  -2.468
## grocery_nn1                          -20.9651       2.2827  -9.184
## age                                 2150.0810     123.2861  17.440
## HH_inc                                 0.7120       0.1182   6.022
## college_perc                       55718.5589   33367.8905   1.670
##                                              Pr(>|t|)    
## (Intercept)                            0.000000000181 ***
## heatedarea                       < 0.0000000000000002 ***
## quality                          < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors                 0.027725 *  
## NUM_FLOORS.catUp to 3 Floors                 0.794705    
## NUM_BEDS.catUp to 2 Bedrooms                 0.003285 ** 
## NUM_BEDS.catUp to 4 Bedrooms                 0.000396 ***
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms   < 0.0000000000000002 ***
## park_nn1                                     0.013600 *  
## grocery_nn1                      < 0.0000000000000002 ***
## age                              < 0.0000000000000002 ***
## HH_inc                                 0.000000001746 ***
## college_perc                                 0.094966 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 403900 on 25546 degrees of freedom
##   (2154 observations deleted due to missingness)
## Multiple R-squared:  0.3015, Adjusted R-squared:  0.3011 
## F-statistic: 848.1 on 13 and 25546 DF,  p-value: < 0.00000000000000022
charlotte.test.nhood <-
  charlotte.test %>%
  mutate(Regression = "Neighborhood Effects",
         price.Predict = predict(reg.nhood, charlotte.test),
         price.Error = price.Predict- price,
         price.AbsError = abs(price.Predict- price),
         price.APE = (abs(price.Predict- price)) / price)%>%
  filter(price < 5000000)

charlotte.test <-charlotte.test %>%
  mutate(Regression = "Baseline")

sales_predictions.sf <- CLT_internal %>%
  mutate(price.Predict = predict(reg.nhood, CLT_internal)) %>%
  filter(toPredict == "CHALLENGE")

sales_predictions.df <- as.data.frame(st_drop_geometry(sales_predictions.sf))
sales_predictions.df <- sales_predictions.df[c(30,43)]


write.csv(sales_predictions.df,"Quisqueyanes.csv", row.names = FALSE)
bothRegressions <- 
  rbind(
    dplyr::select(charlotte.test, starts_with("price"), Regression, MIDD_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)),
    dplyr::select(charlotte.test.nhood, starts_with("price"), Regression, MIDD_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)))    
#Neighborhood effect results
bothRegressions %>%
  dplyr::select(price.Predict, price, Regression) %>%
    ggplot(aes(price, price.Predict)) +
  geom_point() +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(price.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  theme(axis.text.x = element_text(margin = margin(3, 3, 3, 3)))

  plotTheme()
## List of 18
##  $ text             :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "black"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text        :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 10
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks       : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.background: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.text      :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "italic"
##   ..$ colour       : chr "black"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title     :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "italic"
##   ..$ colour       : chr "black"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ panel.background : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.border     :List of 5
##   ..$ fill         : logi NA
##   ..$ colour       : chr "black"
##   ..$ size         : num 2
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ panel.grid.major :List of 6
##   ..$ colour       : chr "grey80"
##   ..$ size         : num 0.1
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.minor : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ plot.background  : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ plot.title       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "black"
##   ..$ size         : num 16
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.subtitle    :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "italic"
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.background :List of 5
##   ..$ fill         : chr "grey80"
##   ..$ colour       : chr "white"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ strip.text       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.x     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 14
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

While controlling for neighborhood improved our model, the difference is small and hard to notice visually as seen in the plot above.

#Scatter plot of MAPE by neighborhood mean price
npa.mean.sf <- charlotte.test %>%
  drop_na(price.APE) %>%
  group_by(MIDD_NAME) %>%
    summarise(mean_APE = mean(price.APE))

npa.price.sf <- charlotte.test %>%
  drop_na(price.APE) %>%
  group_by(MIDD_NAME) %>%
    summarise(mean_Price = mean(price.Predict))

MAPE_by_NPA <- merge(st_drop_geometry(npa.mean.sf), npa.price.sf, by="MIDD_NAME")

grid.arrange(ncol=2,
ggplot(MAPE_by_NPA, aes(mean_Price, mean_APE))+
  geom_jitter(height=2, width=2)+
  ylim(-5,5)+
  geom_smooth(method = "lm", aes(mean_Price, mean_APE), se = FALSE, colour = "red") +
  labs(title = "MAPE by Neighborhood Mean Sales Price",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme(),

ggplot(npa.mean.sf, aes(x=MIDD_NAME, y=mean_APE)) +
geom_point(alpha=0.5) +
labs(title="Sales Price vs. Prediction Error", subtitle="Charlotte Area Home Sales") +
ylab("Mean Absolute % Error") +
 xlab("Observed Sales Price") +
 theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))) 

st_drop_geometry(charlotte.test) %>%
  group_by(MIDD_NAME) %>%
  summarize(MAPE = mean(price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(CLT_neighborhoods) %>%
    st_sf() %>%
    ggplot() + 
   geom_sf(aes(fill = MAPE), color=NA) +
      scale_fill_gradient(low = palette5[5], high = palette5[1],
                          name = "MAPE") +
  geom_sf(data = charlotte.test, colour = "black", show.legend = "point", size = .05) +
      labs(title = "Mean test set MAPE by Block Groups", subtitle="Charlotte Metro Area") +
      mapTheme()

TidyCensus: Evaluating Generalizability # 2

We split the Charlotte Metro Area into two groups: “majority white” and “majority non-white” to test our model’s generalizability. Currently, white non-Hispanic folk represent 45.3% of Mecklenberg County’s population, so it is worthwhile to check whether the accuracy of our predictions is dependent on demographic context.(Source)

For this test, “majority white” is defined as a census block group where ≥ 50% of total population identifies as white.
# Adding race data
Race <- 
  st_read("Population.geojson") %>%
  select(c(9,14)) %>%
  st_transform(st_crs(CLT_internal))
## Reading layer `Census_Population_Block_Groups' from data source 
##   `/Users/kemirichards/Documents/GitHub/Zestimate-Project/Population.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 555 features and 42 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -81.05803 ymin: 35.00171 xmax: -80.5503 ymax: 35.5151
## Geodetic CRS:  WGS 84
# Remove those pesky NAs
Race <- filter(Race, `Population` != 0) %>%
  na.omit

# Calculate percentage of white population
Race <- Race %>%
  mutate(PctWhite = ((`White`/`Population`)*100))

# Creating majority white column
Race <- Race %>%
  mutate(Race = ifelse(PctWhite > 50, "Majority White", "Majority Non-White")) 

# Plot 
ggplot() + geom_sf(data = na.omit(Race), aes(fill = `Race`)) +
    scale_fill_manual(values = c("#FA7800", "honeydew 3"), name="Race Context") +
    labs(title = "Race in Mecklenberg County, NC", 
         subtitle = "Charlotte Metro Area") +
    mapTheme() + theme(legend.position="bottom")

Now, let’s check to see whether our model’s error(s) and predictions were consistent across varied demographic context:

# MAPE by race
Variable <- c("Majority Non-White", "Majority White")
Result <- c("30%", "25%")
MAPE_race <- data.frame(Variable, Result)

# Table 
kable(MAPE_race, caption = "MAPE by Race") %>%
    kable_styling("striped",full_width = F) 
MAPE by Race
Variable Result
Majority Non-White 30%
Majority White 25%
# Mean price by race
Variable <- c("Majority Non-White", "Majority White")
Result <- c("$305,016", "$508,226.50")
PRICE_race <- data.frame(Variable, Result)

# Table
kable(PRICE_race, caption = "Mean Predicted Price by Race") %>% 
  kable_styling("striped",full_width = F) 
Mean Predicted Price by Race
Variable Result
Majority Non-White $305,016
Majority White $508,226.50

The tables above inform us that our model’s errors across a varied demographic context were inconsistent. In particular – our model experienced greater MAPE within Majority Non-White areas: an MAPE of 30% among Majority Non-White neighborhoods indicates on average, our predicted home price was 30% away from the actual value. This rate of error was 5 percentage points lower among Majority White neighborhoods. Likewise, our model’s mean prediction price is highest among Majority White neighborhoods, indicating a bias against Majority Non-White areas. For this reason, our model is not generalizable.

Conclusion

Results

Our variables are able to explain ~50% of the variation observed in Charlotte home prices based upon our R squared. Our mean absolute error (MAE) is 106593.9, indicating that on average, our model’s price predictions differ from actual home prices by ~$106,593. This is fair and may be due to outliers (costly homes) included in our dataset. However, when evaluating our model within a racial context, it does not perform consistently – seemingly undervaluing homes in majority non-white neighborhoods (based upon average predicted price), and experiencing a greater mean absolute percentage error (MAPE) compared to majority white areas. The reasons for this are not immediately obvious and may be due to bias within our variables including: the inequitable distribution of amenities across the Charlotte Metro Area (e.g. parks, schools, grocery stores) as well as historical disparities in median household income and educational attainment (our “spatial structure” variables). These features all factor into the viability and price of a home sale.

Discussion

The inconsistency of our model’s performance within a racial context underscores the challenge in creating algorithms that are fair to all: Algorithms are not magic, but simply refer to historical data in an attempt to forecast the future. Unfortunately, real-life discrimination and disparity is embedded within this data, and an algorithm can easily exacerbate this bias if its primary objective is to increase ‘efficiency’ and greater care isn’t taken to account for these social ills (perhaps by adding weights, omitting discriminatory variables, demographic parity, etc).

Ultimately, it would not be socially responsible or ethically sound to publish an algorithm with easily identifiable biases. So, this means we’ll need to postpone our Zillow pitch meeting for now :)

Future iterations of this model should include sales data as it is a strong predictor towards home valuation (i.e. the sales’ price of your neighbor’s home is likely very similar to the sales price of your home). For this project, we were not allow to use this information which added a level of difficulty which tried to make up for by emphasizing other variables such as # of floors, bedrooms, bathrooms, housing quality, the quality of nearby schools, as well as a home’s proximity to nearby grocery stores. And of course, this model can be strengthened by accounting for biases in the datsets used in the model building process.